For this assignment you should create your own .Rmd file including the questions and code provided on this page.
You should turn in the compiled HTML document generated by your .Rmd file, which should contain your all of your code and the requested output for each question.
Please do not include superfluous output, such as printing large data.frames or vectors, in the file you submit. You should comment-out any code that prints something which isn’t asked for in the question you’re answering.
The provided code below will load a shape file containing polygon boundaries all election precincts in Minnesota, as well as the 2016 election results (at the precinct level) for the state of Minnesota. You’ll need to download the election results excel file here and change the file path in the read_xlsx command to match the location of this excel file on your PC.
For this question, use leaflet to create a map displaying the difference in proportions of republican vs. democrat presidential votes for each precinct. For example, the first precinct is “Aitkin”, where there were 552 republican (USPRSR) and 358 democrat votes (USPRSDFL) out of the 986 total votes cast in the presidential race (USPRSTOTAL). Your map’s color for this precinct might be determined using the value of 552/986 - 358/986 = 0.197 (you could also choose to subtract in a different order).
To display these results, you should use a color palette creating using colorNumeric such that larger margins in favor of democrats are colored darker blues and larger margins in favor of republicans are colored darker red (you may use the built-in palette “RdBu” for this). Your color scheme should be displayed in a legend.
Finally, you should add labels so that when you hover over a precinct you see its name (Precinct) and its margin (the difference in proportions used to color the map’s precincts).
shapeurl <- "https://www.sos.state.mn.us/media/2791/mn-precincts.json"
precincts <- geojson_read(shapeurl, what = "sp")
library(readxl) ## Contains the read_xlsx function
mnres2016 <- read_xlsx(path = 'C:\\Users\\millerry\\Downloads\\2016-general-federal-state-results-by-precinct-official.xlsx', sheet = "Results")
## Your code for 1 continues below this point
For reference, your map should look something like the one shown below (similar but not identical is okay if you satisfy the directions given):
Twitter is a popular social media network in which users can send and receive short 140-character messages (“tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.
## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
data <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")
2-A) Using the stringr package, write code to clean these data by removing the Unicode values (strings like <U+00A0>). To do this, you should assume that anything appearing inside of the characters < and > should be removed.
# Your code for 2-A here
2-B) On twitter, a user may echo another users tweet to share it with their own followers by “retweeting”. In this dataset, all retweets begin with the letters “RT” followed by “@” and the original user’s twitter name. For this question, write code that stores retweets into a separate dataset, then use the length function to find the number of tweets in this dataset.
# Your code for 2-B here
2-C) After excluding retweets, write code the counts the number of times “hate” and “hated” (of any capitalization), and the number of times “love”, “loved”, and “looved” (and all variants with more “o”s and of any capitalization). (Hint: the sum function will count the number of TRUE elements if used on a logical vector)
# Your code for 2-C here
2-D) Twitter hashtags begin with the # symbol followed by string of characters (you can consider a blank space, or " “, to indicate the end of a hashtag). For this question, use regular expressions and the str_extract_all function to extract all hashtags from these tweets. Then, use ggplot to create a bar chart visualizing the 10 most common hashtags that don’t include”ghostbusters" (of any capitalization).
Hints:
unlist function useful to go from a list of hashtags within each tweet to a vector of just hashtagstable and sort functions useful to count and sort the prevalence of each hashtagggplot code will create the desired plot if “top10” is a vector of counts for the top 10 hashtags with names corresponding to the hashtags themselves (this is how the output of the table function is formatted)# Your code for 2-D here
# ggplot() + geom_col(aes(x = names(top10), y = top10), stat = "identity") +
# theme(axis.text.x = element_text(angle = 90, hjust = 1)) ## Rotate the text for better appearance
Remark: To download tweets from Twitter, you need to have a Twitter account and then sign into the developer page. Analyzing twitter data makes for a potentially interesting final project. Details on the authentication procedure can be found at this link: http://thinktostart.com/twitter-authentification-with-r/
This question will review concepts from the variable transformations presentation. It will involve a dataset documenting over 500,000 transactions from a UK-based retailer occurring between 1/12/2010 and 9/12/2011. From this dataset, we’re interested in purchasing volume, or how many total units of product a customer ordered, and purchasing frequency, or how many different instances a customer made an order.
retail <- read.csv("https://remiller1450.github.io/data/OnlineRetail.csv", stringsAsFactors = FALSE)
3-A)
In the code chunk below, use the group_by and summarize functions to create variables describing the volume and frequency of each customer. You should define “volume” as the sum of the “Quantity” variable. You should define “frequency” as the number of unique invoices made by each customer (Hint: the command length(unique(retail$InvoiceNo)) would provide this for the full dataset, you should adapt that code to be used within the summarize function)
# Your code for 3-A here
type your answer to 3-A here
3-B)
Use ggplot to construct a scatterplot displaying “frequency” on the x-axis and “volume” on the y-axis. You should use the default plotting limits (ie: don’t change the x-axis or y-axis scales). Is it possible to discern how these variables are related from this plot?
# Your code for 3-B here
type your answer to 3-B here
3-C)
Apply a log base-10 transformation to each variable using the mutate function. Then, recreate the plot in part B using your newly transformed variables. Is it possible to discern how these variables are related from this plot?
# Your code for 3-C here
type your answer to 3-C here
3-D)
A fitted linear regression model for the transformed data looks like:
\[log_{10}(\text{volume}) = 2.09 + 1.07*log_{10}(\text{frequency})\]
How do you interpret the slope coefficient of this model? (be specific)
type your answer to 3-D here
AmesHousing <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-viz/data/AmesHousing.csv")
## Select numeric variables of interest
Ames_Subset <- select(AmesHousing, LotFrontage, LotArea, OverallQual, OverallCond, BsmtFinSF1, BsmtUnfSF, GrLivArea, TotRmsAbvGrd, FullBath, GarageArea, WoodDeckSF, ScreenPorch, SalePrice)
## Remove homes w/ missing data
Ames_Subset <- na.omit(Ames_Subset)
## Separate off Sale Price from home attributes
Ames_Subset_Prices <- Ames_Subset$SalePrice
Ames_Subset <- select(Ames_Subset, -SalePrice)
4-A)
After standardizing the data.frame Ames_Subset, perform PCA, and visualize the first two principal components using the fviz_pca_biplot function. Based upon what you see, argue whether the first principal component scores can be described as each home’s “overall size”.
# Your code for 4-A here
type your answer to 4-A here
4-B)
Using your output from Question 4-A, provide an interpretation for the second principal component.
type your answer to 4-B here
4-C)
Construct a scree plot using fviz_screeplot. Based upon this plot, how many principal components do you think should be used to accurately represent the important features of these data?
# Your code for 4-C here
type your answer to 4-C here
4-D)
Color the homes in the biplot you constructed in Part 5-A using the vector Ames_Subset_Prices and the argument “col.ind”. Based upon your new plot, do the first two principal components seem to be related with sale price? If so, how?
# Your code for 4-D here
type your answer to 4-D here